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Abstract. The instances of templates in Wikipedia form an interesting 
data set of structured information. Here I focus on the cite journal 
template that is primarily used for citation to articles in scientific jour- 
nals. These citations can be extracted and analyzed: Non-negative matrix 
factorization is performed on a (article x journal) matrix resulting in a 
soft clustering of Wikipedia articles and scientific journals, each cluster 
more or less representing a scientific topic. 



1 Introduction 

The category system and the use of templates in Wikipedia provide interest- 
ing data sets of structured information. A number of reports have come out 
that use the category graph in automatic text processing, e.g., [11213] . DBpcdia 
databases Wikipedia template information and associated Internet services en- 
able database-like queries ^ . I have previously reported results of relatively sim- 
ple statistical analysis about a single Wikipedia template — the cite journal 
template — counting the number of overall outbound scientific citations and 
comparing it to the citation statistics Journal Citation Reports from the com- 
pany Thomson Scientific [5]. Other researcher have considered more advanced 
statistical models in the form of multivariate analysis |6I7) . They build a matrix 
from intrawiki links and submit it to numerical algorithms. Here I will take a sim- 
ilar approach but construct the matrix from data associated with the scientific 
citation template rather than wikilinks. The present work will show an example 
on how to make multivariate statistical analysis on the structured data in a Wiki, 
and in this particular case provide an overview of how science is represented in 
Wikipedia. 

2 Method: Prom XML via matrices to topic visualization 

A Perl script extracted the instances of the cite journal templates from bzip- 
ped XML files of the English Wikipedia downloaded from the Internet server 
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download. wikipedia. org. Another Perl script extracted the name of the jour- 
nal from the journal field in the template, and at the same time tried to match 
the name to a 'canonical' journal name. For the matching a small XML file — 
originally built for the neuroinformatics Brede Database — listed the canon- 
ical name and variations in the names for so far 255 different journals, e.g., the 
entry for the journal with the canonical name Proceedings of the Royal Society 
of London, Series B, Biological Sciences listed 12 other variations for the name 
including the PubMed abbreviation Proc R Soc Lond B Biol Sci. These 255 jour- 
nals comprised a large part of the top cited journals from Wikipedia, and thus 
the script normalizes very many citations to a canonical name, but indeed far 
from all variations to lesser cited journals are resolved. There are a number of 
other issues that prevents the databasing of the citations to be particular exact: 
Special cases of journal naming make it hard to match all journal names with 
a canonical journal name, e.g., Mutation Research are actually three (or four) 
different journals, wrt. to ISSN. Cited 'journals' may not be scientific journals, 
but, e.g., newspapers. Citations that occur multiple times in the same Wikipedia 
article to the same item (by the <ref n£mie=" anchor "/> construct) were only 
counted once. 

A (article x journal) data matrix is built up where each column corresponds 
either to a canonical journal name or the journal name as written in the citation 
of the Wikipedia articles. Each row corresponds to a Wikipedia article. The (i, j) 
elements in the matrix is set to the number of times the iih. article cites the jth 
journal. Most of the elements in the matrix are zero. 

Clustering of the constructed matrix is performed with the multiplicative 
update rules of the non-negative matrix factorization (NMF) as put forward by 
Lee and Seung 9 . The algorithm for the 'Euclidean distance' runs with 50,000 
iterations. This particular multivariate analysis resembles several other methods 
such as the one used by Buntine in his Wikipedia analysis [6] as well as Bellomi 
and coworkers' analysis [7j. NMF splits the data matrix X(article x journal) into 
three other matrices X = WH + U. Whereas U is just the residual matrix, the 
factorized matrices W and H form the interesting matrices that may be expected 
to represent specific scientific topics characterized by their citation patterns: A 
specific column in W can be interpreted to contain the loadings of articles on 
a specific 'topic' that the cluster represents, and a specific row in H contains 
loadings for journals on that topic. The NMF results not in a hard clustering 
where the items are assigned exclusively to one cluster, — rather in a soft two- 
way clustering. Using the Kleinberg terminology (10', the W matrix contains 
loadings for Wikipedia 'hub' articles, whereas H contains 'authoritative' journal 
articles. One advantage of the Lee and Seung's 'Euclidean distance' version of the 
NMF algorithm is that no multiplications take place with the full reconstructed 
data matrix, i.e., the product matrix WH. This is in contrast to 'divergence' 
version, that in my implementation is much slower and use more memory for 
these kinds of data sets. 

The initialization of the NMF algorithm requires the specification of the 
number of clusters, i.e., the number of columns in the W matrix and the number 
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Fig. 1. Number of outbound scientific citations as counted from the use of the 
cite journal template for different dumps of the Enghsh Wikipedia. A sharp 
rise is seen from the 2007 dumps to the 2008 dump due to citations added by a 
bot. 

of rows in the H matrix. I make the NMF algorithm run with different number 
of clusters: From one to twenty. Each run will be independent of the other and 
they can be run in parallel on a computer cluster. Many results appear when 
running the NMF with different number of clusters, and a so-called 'cluster 
bush' visualization can be used to get a overview of the relationship between the 
different clusterings [llj . In this kind of plot each cluster is rendered as a circle 
and the amount of overlap between two clusters is indicated with the thickness 
of a line. 

The NMF algorithm is run and the cluster bush visualization is made in 
Matlab with functions from the Brede Toolbox [12]. 

3 Results 

Examing the full count of scientific citations from Wikipedia a marked increase 
becomes apparent with a rise in the number of citations from 2007 to the exam- 
ined dump of March 2008, see Figure [H From 74,776 citations in the October 
2007 dump to 228,593 in the March 2008 dump. 

Whereas astronomy journals received comparably many citations from Wiki- 
pedia in the 2007 dumps, and journals such as The Journal of Biological Chem- 
istry had relatively few citations when compared to the Journal Citation Re- 



Citations Journal name 



16739 The Journal of Biological Chemistry 
12779 PNAS 

8772 Genome Research 

7561 Nature 

4007 Nature Genetics 

3928 Genomics 

3689 Science 

3511 Gene 

3380 Biochemical and Biophysical Research Communications 
3043 Molecular and Cellular Biology 
2975 Cell 

2261 The EMBO Journal 

Table 1. Most cited journals from Wikipedia in the 12th March 2008 dump. 



ports, this citation pattern is now very much changed: Wikipedians have con- 
structed the hot ProteinBoxBot that automatically builds infoboxes and citations 
in Wikipedia articles. Thus a very large number of citations to protein/gene work 
has been added, and with the March 2008 dump The Journal of Biological Chem- 
istry can be found as the most cited journal, see Table [1] Scientific articles cite 
also this journal the most, according to Journal Citation Reports. 

The conversion of the information in the templates to a matrix representation 
results in matrices size (23595 x 18194) and (43073 x 23096) for the October 2007 
and March 2008 dump, respectively. The densities of the constructed matrices 
are 0.01%-0.02% depending on the dump version of Wikipedia. The number of 
columns in the matrices would have been smaller and the density higher if the 
matching of journal names was more complete. The number of articles using 
the cite journal template has almost doubled in less than half a year between 
the two dumps. This increase is likely due to the large number of articles added 
for proteins/genes by ProteinBoxBot. Many of these articles have no other text 
besides the text added by the bot and the citations are not in-text citations. 

Figure [2] displays a cluster bush visualization of the NMF results for the 
October 2007 dump for NMF, and for clarity only the NMF runs with one to 
seven clusters are shown: The bottom row displays the run with just one cluster, 
where the Wikipedia articles List of molecules in interstellar space and Extinc- 
tion (astronomy) are the largest hubs. The Astrophysical Journal awA Astronomy 
& Astrophysics are the largest authoritative journals for this astrophysical clus- 
ter. As the NMF model size increases, i.e., more clusters get added, this topic 
continues to be a cluster of its own. The new clusters that arise are related to 
medical sciences, intelligence, human leukocyte antigen and bacteria. For these 
runs of NMF the columns corresponding to the cross-disciplinary journals Na- 
ture, Science and PNAS were excluded. 

With the present algorithm a few of the clusters represent very restricted 
topics, e.g., in one case the article Henry George Fourcade and the journal The 
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Authoritative journals 


'Cancer' 


RBL2 
MYB 

ERG (gene) 
EPS8 


Oncogene 
Cancer Research 
Int. J. Cancer 
Gene & Development 


'Immunolof 


ly' DNA vaccination 
CCL21 
HLA-DQ8 
HLA-DQAl 


The Journal of Immunology 

The Journal of Experimental Medicine 

Tissue Antigens 

Eur. J. Immunol. 


'Blood' 


Acute myeloid leukemia Blood 

Serpin British Journal of Haematology 
CEBPE The Journal of Clinical Investigation 
CD34 The Journal of Experimental Medicine 


'Virology' 


Papillomavirus 
HHV Infected Cell . . . 
Poliovirus 
RELB 


The Journal of Virology 
Virology 

Journal of Molecular Biology 
AIDS Res. Hum. Retroviruses 



Table 2. The top Wikipedia hubs articles and authoritative journals with respect 
to clusters from a non-negative matrix factorization with twenty clusters. 



Photogrammetric Record constituted a single cluster. Another cluster that is also 
dominated by single items has the article about the group of genes Solute carrier 
family and the journal Pfliigers Archiv European Journal of Physiology. 

Applying NMF on the March 2008 dump results in components that are 
overwhelmingly affected by the large number of citations in the protein/gene 
articles. A run of NMF with twenty clusters resulted in only three clusters that 
did not exhibit an association with genes: One cluster centered around solar 
system astronomy with the journal Icarus as the primary authoritative journal 
and Uranus as the top hub Wikipedia article, another cluster centered around 
The Astrophysical Journal, and the third as a medical clusters with New England 
Journal of Medicine and The Lancet as top authorities and Myocardial infarction 
as the Wikipedia hub article. The rest of the seventeen clusters were all related 
to proteins and genes or other closely related topics within biology and biochem- 
istry. Many of these clusters are mostly driven be a single journal, i.e., a single 
element in each row of the H matrix are much larger than the rest of the ele- 
ments, whereas the W matrix shows a much more equal loading over Wikipedia 
articles within each cluster, e.g., one cluster interpretable as a 'virology' cluster 
would have The Journal of Virology as the dominating authoritative journal. 

A few examples of items in a sample of clusters from an NMF run with 
twenty clusters are shown in Tabled These kinds of results may be written to 
an HTML page and put on the web to serve as an online overview of how science 
is cited from Wikipedia. 
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Fig. 2. 'Cluster bush' visualization of results for non-negative matrix factoriza- 
tion (NMF) of the scientific citations in the 18th October 2007 dump of the 
English Wikipedia. Each circle denotes a cluster. The lowest row displays the 
results of an NMF run with one cluster, second lowest row the results for NMF 
with two clusters, etc. The text on the nodes are the Wikipedia articles that are 
associated with high loadings in the factorized matrices of the NMF. The lines 
between the nodes indicate how much the clusters overlap. 



